# Topic 5

#### **Single Cycle Processor**

### Introduction

- Starting from this topic, we will examine two RISC-V implementations
  - A simplified version
  - A more realistic pipelined version
- Supporting a simple subset, showing most aspects of RISC-V design
  - Memory reference: 1w, sw
  - Arithmetic/logical: add, sub, and, or
  - Conditional branch: beq
- You are expected to be able to change the hardware to support more instructions

#### Instruction Execution

- PC → program memory, fetch instruction
- ② Use register addresses to access register file, read registers
- 3 Depending on instructions
  - Use ALU/Adder to calculate
    - Arithmetic/logic result
    - Memory address for load/store
    - Branch target address
- 4 Access register file or data memory for load or store or save calculation results
- ⑤ Update PC ← target address or PC + 4

### **Building a Datapath**

- Datapath
  - Elements that process, store, or route data in the CPU
  - E.g. registers, ALUs, mux's, memories, ...
- We will build a RISC-V datapath incrementally
- We will add a control unit to provide control signals to the datapath

#### **Instruction Fetch**



#### **R-Format Instructions**

- Operations: add rd, rs1, rs2
  - Read two register operands
  - Perform arithmetic/logical operation
  - Write register result
- We need:



a. Registers b. ALU

# Register Files



- Data is read out whenever register numbers are provided
- Writes
  - Edge-triggered
  - All the write inputs (Write Data, Write Register, RegWrite) must be valid before the clock edge – setup time
- Can read and write the same register within a clock cycle: write first, followed by read.

#### **Load/Store Instructions**

- Operations: lw rd, imm12(rs1) sw rs2, imm12(rs1)
  - Read register operands
  - Calculate address using 12-bit immediate
  - Load: Read memory and update register
  - Store: Write register value to memory
- Additionally, we need:



### **Beq Instructions**

- Operations: beq rs1, rs2, target
  - Read register operands
  - Compare operands
    - $rs1==rs2? \Rightarrow rs1-rs2==0?$
    - Use ALU, subtract then check the Zero output of the ALU
  - Calculate target address

```
Target PC = Current PC + immediate \times 2
```

- Sign-extend the immediate
- Shift left 1 place (multiply by 2)
- Add to PC value

#### **Branch Instructions**



| Туре   | Field          |        |        |               |            |        |  |  |  |  |
|--------|----------------|--------|--------|---------------|------------|--------|--|--|--|--|
|        | 7 bits         | 5 bits | 5 bits | 3 bits        | 5 bits     | 7 bits |  |  |  |  |
| I-type | immediate[11:0 | rs1    | funct3 | rd            | opcode     |        |  |  |  |  |
| S-type | immed[11:5]    | rs2    | rs1    | funct3        | immed[4:0] | opcode |  |  |  |  |
| B-type | immed[11,9:4]  | rs1    | funct3 | immed[3:0,10] | opcode     |        |  |  |  |  |
| U-type | immed          | rd     | opcode |               |            |        |  |  |  |  |
| J-type | immediate[     | rd     | opcode |               |            |        |  |  |  |  |

- In B-type (SB-type) and J-type (UJ-type), immediate bits are swirled around
  - It saves hardware (muxes) in "Imm Gen" component which picks different fields in an instruction to assemble the immediate number according to different type of the instruction

# Composing the Elements

- First version of datapath executes an instruction in one single clock cycle
  - Each datapath element can only do one function at a time
    - Hence, we need separate instruction memory and data memory
- Use multiplexers where alternate data sources are used for different instructions

# **Combine The Operations**



# **Add Multiplexers**



# **Add Control Signals**



# **Putting Everything Together**



- RISC-V has more types (6) then MIPS (4)
  - But it saves hardware on the critical path





MIPS Architecture: simpler instruction format, but more complicated hardware, longer delay

RISC-V Architecture: more complicated instruction format, but simpler hardware, thus better performance

#### **Add ALU Control**

- ALU used for
  - Load/Store: Function = add
  - Branch: Function = subtract
  - R-type: Function depends on opcode

| ALU control Signal | Function |  |  |
|--------------------|----------|--|--|
| 0000               | AND      |  |  |
| 0001               | OR       |  |  |
| 0010               | add      |  |  |
| 0110               | subtract |  |  |

#### **ALU Control Unit**

- ALU Control is a combinational logic
  - Assume 2-bit ALUOp derived from opcode



| Instr. | Operation       | ALU function | Opcode<br>field | ALUOp<br>(input) | i[30]<br>(input) | funct3<br>(input) | ALU control (output) |
|--------|-----------------|--------------|-----------------|------------------|------------------|-------------------|----------------------|
| lw     | load register   | add          | XXXXXX          | 00               | X                | 010               | 0010                 |
| sw     | store register  | add          | XXXXXX          | 00               | X                | 010               | 0010                 |
| beq    | branch on equal | subtract     | XXXXXX          | 01               | X                | 000               | 0110                 |
| R-     | add             | add          | 100000          | 10               | 0                | 000               | 0010                 |
| type   | subtract        | subtract     | 100010          |                  | 1                | 000               | 0110                 |
|        | AND             | AND          | 100100          |                  | 0                | 111               | 0000                 |
|        | OR              | OR           | 100101          |                  | 0                | 110               | 0001                 |

### **Datapath With Control Unit**



### **R-Type Instruction**



#### **Load Instruction**



### **Branch-on-Equal Instruction**



# **Control Signals**

Control signals derived from instruction

| Inst. | ALU<br>Src | Mem<br>Write | Mem<br>Read | Branch | Memto<br>Reg | Reg<br>Write |
|-------|------------|--------------|-------------|--------|--------------|--------------|
| add   |            |              |             |        |              |              |
| addi  |            |              |             |        |              |              |
| lw    |            |              |             |        |              |              |
| SW    |            |              |             |        |              |              |
| beq   |            |              |             |        |              |              |

#### **Performance Related Factors**

- Algorithm
  - Determines number of operations executed
- Programming language, compiler, architecture
  - Determine number of machine instructions (lines of source code) executed per operation
- Processor and memory system
  - Determine how fast instructions are executed
- I/O system (including OS)
  - Determines how fast I/O operations are executed

#### **How to Measure Computer Performance?**

- Execution (response) time
  - How long it takes to do a task
  - Measure for PCs and embedded computers
- Throughput
  - Total work done per unit time
  - Servers

We'll focus on execution time for now

#### **Execution Time**

- Elapsed time for System Performance
  - Total execution time to complete a task, including
    - Processing, I/O, OS overhead, idle time, everything
  - But doesn't completely reflect computer's performance if it focuses on better throughput
- CPU time for CPU Performance
  - CPU execution time processing a task
    - Exclude I/O time, time spent for other tasks
  - Comprises user CPU time and system CPU time
    - Hard to differentiate
  - Different programs run with different CPU performance and system performance

### **CPU Clocking**

 Operation of digital hardware governed by a constant-rate clock



- Clock period: duration of a clock cycle
  - e.g.,  $250ps = 0.25ns = 250 \times 10^{-12}s$
- Clock frequency (rate): cycles per second
  - e.g.,  $4.0GHz = 4000MHz = 4.0 \times 10^9Hz$

#### **CPU Time**

CPU Time = CPU Clock Cycles × Clock Cycle Time

= CPU Clock Cycles

Clock Rate

- Performance improved by
  - Reducing number of clock cycles
  - Increasing clock rate
  - Hardware designer must often trade off clock rate against cycle count

## **CPU Time Example**

- Computer A: 2GHz clock, 10s CPU time
- Designing Computer B
  - Aim for 6s CPU time
  - Can do faster clock, but causes 1.2 × clock cycles
- How fast must Computer B clock be?

$$\begin{aligned} \text{Clock Rate}_{\text{B}} &= \frac{\text{Clock Cycles}_{\text{B}}}{\text{CPU Time}_{\text{B}}} = \frac{1.2 \times \text{Clock Cycles}_{\text{A}}}{6\text{s}} \\ \text{Clock Cycles}_{\text{A}} &= \text{CPU Time}_{\text{A}} \times \text{Clock Rate}_{\text{A}} \\ &= 10\text{s} \times 2\text{GHz} = 20 \times 10^9 \\ \text{Clock Rate}_{\text{B}} &= \frac{1.2 \times 20 \times 10^9}{6\text{s}} = \frac{24 \times 10^9}{6\text{s}} = 4\text{GHz} \end{aligned}$$

#### Instruction Count and CPI

Clock Cycles = Instruction Count  $\times$  Cycles per Instruction

CPU Time = Instruction Count  $\times$  CPI  $\times$  Clock Cycle Time  $= \frac{Instruction Count \times CPI}{Clock Rate}$ 

- Instruction Count (IC) for a program
  - Determined by program, ISA and compiler
- Average cycles per instruction (CPI)
  - Determined by CPU hardware
  - If different instructions have different CPI
    - Average CPI affected by instruction mix

### **CPI Example**

- Computer A: Cycle Time = 250ps, CPI = 2.0
- Computer B: Cycle Time = 500ps, CPI = 1.2
- Same ISA
- Which is faster, and by how much?

$$\begin{aligned} \text{CPU Time}_{A} &= \text{Instruction Count} \times \text{CPI}_{A} \times \text{Cycle Time}_{A} \\ &= I \times 2.0 \times 250 \text{ps} = I \times 500 \text{ps} & \quad \text{A is faster...} \end{aligned}$$
 
$$\begin{aligned} \text{CPU Time}_{B} &= \text{Instruction Count} \times \text{CPI}_{B} \times \text{Cycle Time}_{B} \\ &= I \times 1.2 \times 500 \text{ps} = I \times 600 \text{ps} \end{aligned}$$
 
$$\begin{aligned} &\frac{\text{CPU Time}_{B}}{\text{CPU Time}_{A}} &= \frac{I \times 600 \text{ps}}{I \times 500 \text{ps}} = 1.2 & \quad \text{...by this much} \end{aligned}$$

### **Performance Summary**

#### **The BIG Picture**

$$CPU Time = \frac{Instructions}{Program} \times \frac{Clock \ cycles}{Instruction} \times \frac{Seconds}{Clock \ cycle}$$

Performance depends on

- **Clock Period**
- Algorithm: affects IC, possibly CPI
- Programming language: affects IC, CPI
- Compiler: affects IC, CPI
- Instruction set architecture: affects IC, CPI, Tc

# Single Cycle Implementation



#### **State Machine**

Finite State Machine



# **Clocking Methodology**

- Combinational logic does the computation during clock cycles
  - Between clock edges
  - Input (present state) from state elements, output (next state) to state element
  - Among all kinds of computations, longest delay determines clock period



- Longest instruction determines the clock cycle time
  - Critical path: the path having longest delay
    - load instruction
       Instruction memory → register file → ALU → data memory → register file (plus MUXes)
- Not feasible to vary period for different instructions
  - Waste time on other instructions
- How to improve the performance (faster speed)?

- Assume times for major components are
  - 100ps for register read or write
  - 200ps for accessing memory
  - 200ps for ALU operations

| Instruction | Instr<br>fetch | Register read | ALU op | Data<br>Memory | Register write | Total<br>time |
|-------------|----------------|---------------|--------|----------------|----------------|---------------|
| lw          | 200ps          | 100 ps        | 200ps  | 200ps          | 100 ps         | 800ps         |
| SW          | 200ps          | 100 ps        | 200ps  | 200ps          |                | 700ps         |
| R-format    | 200ps          | 100 ps        | 200ps  |                | 100 ps         | 600ps         |
| beq         | 200ps          | 100 ps        | 200ps  |                |                | 500ps         |

- Assume 100 instructions are executed
  - 15% are loads
  - 15% are stores
  - 40% are R format instructions
  - 30% are branches
- What's the clock cycle time for single-cycle processor?
- Execution time using single-cycle processor?